Adaptive Prosody Modelling for Improved Synthetic Speech Quality
نویسندگان
چکیده
Neural networks and fuzzy logic have proven to be efficient when applied individually to a variety of domain-specific problems, but their precision is enhanced when hybridized. This contribution presents a combined framework for improving the accuracy of prosodic models. It adopts the Adaptive Neuro-fuzzy Inference System (ANFIS), to offer selftuned cognitive-learning capabilities, suitable for predicting the imprecise nature of speech prosody. After initializing the Fuzzy Inference System (FIS) structure, an Ibibio (ISO 693–3: nic; Ethnologue: IBB) speech dataset was trained using the gradient descent and non-negative least squares estimator (LSE) to demonstrate the feasibility of the proposed model. The model was then validated using synthesized speech corpus dataset of fundamental frequency (F0) values of ibibio tones, captured at various contour positions (initial, mid, final) within the courpus. Results obtained showed an insignificant difference between the predicted output and the check dataset with a checking error of 0.0412, and validates our claim that the proposed model is satisfactory and suitable for improving prosody prediction of synthetic speech.
منابع مشابه
On Customizing Prosody in Speech Synthesis: Names and Addresses as a Case in Point
This work assesses the contribution of domain-specific prosodic modelling to synthetic speech quality in a name-and-address information service. A prosodic processor analyzes the textual structure of labelled input strings, and inserts markers which specify the intended prosody for the DECtalk text-to-speech synthesizer. These markers impose discourse-level prosodic organization, annotate the i...
متن کاملWeighted neural network ensemble models for speech prosody control
In text-to-speech synthesis systems, the quality of the predicted prosody contours influences quality and naturalness of synthetic speech. This paper presents a new statistical model for prosody control that combines an ensemble learning technique using neural networks as base learners with feature relevance determination. This weighted neural network ensemble model was applied for both, phone ...
متن کاملImproving the naturalness of synthetic speech by utilizing the prosody of natural speech
The quality of synthetic speech is greatly improved if a prosody of natural speech is adopted instead of a rule based prosody. In order to apply this effect to an arbitrary word synthesis, the authors proposed a new prosody control method. According to the result of a listening test, it was shown that rhythm could be independently controlled from pitch and power whereas pitch and power should b...
متن کاملPerceptual Evaluation of Quality Deterioration Owing to Prosody Modification
Our reasearch goal is to construct a Japanese TTS (Text-to-Speech) system that can output various kinds of prosody. Since such synthetic speech is useful for a practical use, many TTS systems have implemented global prosodic control processing. But fundamentally they're designed to output speech with standard pitch and speech rate. We discuss synthesis method for high quality speech with extrem...
متن کاملEvolutionary optimization of an adaptive prosody model
The perceived quality of synthetic speech strongly depends on its prosodic naturalness. Concerning the control of duration and fundamental frequency in a speech synthesis system, sophisticated models have been developed during the last decade. Departing from the syllable-based, adaptive prosody model IGM the authors surveyed a novel evolutionary approach to optimize the model structure itself a...
متن کامل